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Value  iteration  and  policy  iteration  are  two  well  known  computational  methods  for 
solving  Markov  renewal  decision  processes.  Value  iteration  converges  linearly,  while  policy 
iteration  (typically)  converges  quadratically  and  is  therefore  more  attractive  in  principle. 
However,  when  the  state  space  is  very  large  (or  continuous),  the  latter  asks  for  solving  at 
each  iteration  a  large  linear  system  (or  integral  equation)  and  becomes  unpractical. 

.  ...  J?'' 

We  propose  an  approximate  policy  iteration”method,  targeted  especially  to  systems 
with  continuous  or  large  state  spaces,  for  which  the  Bellman  (expected  cost-to-go)  func¬ 
tion  is  relatively  smooth  (or  piecewise  smooth).  These  systems  occur  quite  frequently  in 
practice.  The  method  is  based  on  an  approximation  of  the  Bellman  function  by  a  linear 
combination  of  an  a  priori  fixed  set  of  base  functions.  At  each  policy  iteration,  we  build 
a  linear  system  in  terms  of  the  coefficients  of  these  base  functions,  and  solve  this  system 
approximately.  We  give  special  attention  to  a  particular  case  of  finite  element  approxima¬ 
tion  where  the  Bellman  function  is  expressed  directly  as  a  convex  combination  of  its  values 
at  a  finite  set  of  grid  points.  ^ _ , 

In  the  first  part  of  the  paper,  we  survey  and  extend  slightly  some  basic  results  concern¬ 
ing  convergence,  approximation,  and  bounds.  All  along  the  paper,  we  consider  both  the 
discounted  and  average  cost  criteria.  Our  models  are  infinite  horizon  and  stationary. 


‘Departement  d’informatique,  Universite  Laval,  Ste-Foy,  Quebec,  Canada,  G1K  7P4. 


1.  Introduction 


When  solving  a  Markov  Renewal  Decision  Process  (MRDP)  with  large  (sometimes  contin¬ 
uous)  state  or  action  spaces  [17],  it  is  usually  necessary  to  use  some  sort  of  discretization 
or  approximation  procedure  [1,  2,  8,  11,  12,  13,  16,  18,  19,  22,  34,  35].  The  discretized  (or 
“smaller”)  problem  can  then  be  solved  using  either  linear  programming,  policy  iteration, 
value  iteration,  or  some  hybrid  combination  of  these  [2,  13,  18,  19,  26,  27,  28,  32,  34]. 

A  natural  approach  consists  in  partitioning  the  state  and  action  spaces  into  a  finite  class 
of  subsets,  selecting  a  representative  element  in  each  subset,  and  defining  and  approximate 
but  more  tractable  model,  with  finite  state  and  action  spaces.  This  form  of  approxima¬ 
tion  has  been  proposed  and  studied  theoretically  in  [1,  2,  11,  12,  35].  Typically,  for  the 
approximation  schemes  suggested  by  these  authors,  the  value  function  in  the  Dynamic 
Programming  (DP)  functional  equation  is  approximated  by  a  piecewise  constant  function, 
constant  on  each  subset  of  the  partition  of  the  state  space.  More  sophisticated  approachs 
like  polynomial  or  spline  approximation  or  interpolation,  finite  element  methods,  etc.,  were 
also  suggested  in  [8,  13,  22,  34]. 

Schweitzer  and  Seidmann  [34]  considered  a  MRDP  with  finite  state  and  action  sets  and 
suggested  polynomial  approximation  of  the  value  function.  They  proposed  three  computa¬ 
tional  algorithms:  linear  programming,  policy  iteration  with  least  squares  approximation, 
and  global  least  squares  fit. 

Daniel  [8]  suggested  spline  approximation  for  a  deterministic,  finite  horizon,  continuous 
control  problem.  Haurie  and  L’Ecuyer  [13]  considered  a  discounted  MRDP  model  and 
suggested  spline  or  finite  element  methods  to  approximate  the  value  function  at  each  step 
of  the  value  iteration  algorithm.  The  approximation  method  and/or  grid  may  vary  from 
iteration  to  iteration.  They  provide  formulas  to  compute  (or  estimate)  bounds  on  the 
optimal  value  function  and  on  the  value  function  associated  to  the  optimal  policy,  taking 
into  account  the  approximation  error  at  each  iteration  and  the  fact  that  only  a  finite 
number  of  iterations  are  made.  This  kind  of  approach  can  also  be  used  (heuristically)  with 
Schweitzer’s  algorithm  [32]  for  the  undiscounted  case  (see  [18,  19]  for  applications  and 
numerical  results).  Value  iteration  converges  geometrically  (at  a  linear  rate)  [4,  9,  32],  but 
often  with  a  factor  of  almost  one,  which  makes  it  rather  slow  in  terms  of  the  number  of 
iterations.  Policy  iteration,  on  the  other  hand,  converges  typically  at  a  quadratic  rate  [26, 
27],  and  is  thus  much  more  attractive  in  principle.  One  difficulty  is  that  when  the  state 
space  is  continuous,  or  has  large  cardinality,  each  policy  evaluation  step  asks  for  solving 
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either  an  integral  equation  or  a  huge  linear  system,  which  is  too  costly  or  impossible  to 
solve  in  practice.  This  is  the  main  reason  why  value  iteration  is  often  suggested  for  such 
cases  (see  for  instance  [13,  32])  despite  the  fact  that  it  is  sometimes  very  time  consuming, 
especially  when  each  iteration  involves  a  numerical  integration  at  each  evaluation  point 
[13,  18,  19]. 

In  this  paper,  we  consider  a  MRDP  model  with  general  (Borel)  state  and  action  spaces, 
as  introduced  in  [13,  17].  We  describe  a  finite  element  computational  approach  to  deal 
with  continuous  or  very  large  state  spaces.  The  general  idea  is  to  express  the  Bellman 
(expected  cost-to-go)  function  V  as  a  linear  combination  of  a  finite  number  of  (simple) 
base  functions  Bj, . . . ,  Bj: 

V(,)  (1) 

j= l 

We  replace  V  by  this  linear  combination  (with  unknown  coefficients  d3)  into  the  DP  func¬ 
tional  equation  that  corresponds  to  a  fixed  policy  (equation  (38)  below). 

A  direct  generalization  of  the  approach  proposed  in  [34]  is  to  integrate  this  equation 
(numerically  or  symbolically)  for  a  finite  number  of  states,  say  oj, . . .  ,07.  If  /  =  J,  we 
obtain  from  the  functional  equation  a  linear  system  in  terms  of  the  coefficients 
If  l  >  J  (more  evaluation  points  than  unknowns),  we  can  use  least  squares  fitting  to 
determine  the  dj' s:  again,  to  minimize  the  quadratic  form,  a  J-dimensional  linear  system 
must  be  solved.  Schweitzer  and  Seidmann  [34]  proposed  this  approach  for  the  case  of  finite 
state  and  action  spaces,  and  where  5  =  {<7i, . . . ,  07}.  The  linear  system  is  not  easy  to 
solve  in  general  for  large  values  of  J. 

A  second  approach  is  to  use  a  finite  element  method  [14].  We  first  define  a  scalar 
product  on  the  space  of  real-valued  functions  of  the  state  s.  Then,  the  basic  idea  is  that 
after  replacing  V  by  (1)  in  the  DP  equation  (38),  we  ask  the  scalar  product  of  this  equation 
by  any  arbitrary  function  of  the  form  (1)  to  be  a  valid  equation.  This  yields  a  system  of  J 
equations  in  J  unknowns.  In  a  special  case  of  this  approach,  F(s)  turns  out  to  be  expressed 
directly  as  a  convex  combination  of  the  values  of  V  at  J  evaluation  points 

V{>)  =  Y.VWi)BM  (2) 

J=! 

where  0  <  B&)  <  1,  Bj(<n)  =  Sij  (the  Kronecker’s  delta),  and  J2j  Bj(s)  =  1  for  each  s. 
The  cr j’s  are  in  fact  the  nodes  of  the  finite  elements.  When  inserting  this  V  into  the  DP 
equation  (38),  one  obtains  a  linear  system  whose  matrix  is  srbs*ochastic,  and  where  the 
unknowns  are  V(<t  1), . . . ,  V(o-j).  In  this  case  the  system  can  be  partially  solved  using  just 
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a  few  iterations  of  an  iterative  method,  starting  from  the  previous  value  of  V.  Another 
useful  property  of  this  scheme  is  that  if  we  change  the  set  of  evaluation  points  <Jj  and 
functions  Bj  after  a  number  of  iterations,  it  is  easy  to  compute  the  new  values  of  V (a^ ) 
from  the  old  ones,  and  continue  the  iterations  with  them. 

Our  suggested  algorithm  with  the  finite  element  approach  can  be  viewed  as  an  exten¬ 
sion  of  the  “modified  policy  iteration”  algorithm  studied  by  Puterman  and  Shin  [26]  for 
discounted  Markovian  decision  process  (MDP)  models.  We  start  with  some  policy  p,  and 
iterate  over  the  following  steps:  partition  (triangulate)  the  state  space  into  finite  elements, 
with  nodes  cr\  ,  ...,<7j  and  base  functions  B\, . . .  ,Bj,  construct  the  linear  system  associ¬ 
ated  with  this  set  and  the  policy  fi,  solve  this  system  approximately  for  the  V(<7j)’s,  put 
this  solution  back  into  the  DP  functional  equation,  perform  a  minimization  step  to  obtain 
a  new  policy  /i,  and  repeat.  To  solve  the  system  approximately,  one  can  just  perform  a 
few  iterations  of  an  iterative  method,  possibly  combined  with  some  aggregation  steps  [4, 
5].  The  “extension”  lies  mostly  in  the  way  that  we  approximate  the  Bellman  function  to 
treat  the  case  of  continuous  state  spaces.  We  also  consider  the  undiscounted  case. 

Aggregation-disaggregation  techniques  have  been  proposed  for  accelerating  convergence 
in  MRDPs.  See  [4,  5]  and  the  references  cited  there.  Basically,  the  state  space  is  parti¬ 
tioned  into  a  finite  number  of  subsets,  which  form  the  “states”  of  the  aggregate  process. 
In  practice,  defining  the  partition  is  usually  difficult.  Bertsekas  and  Castanon  [5]  have  in¬ 
troduced  an  interesting  aggregation  approach  in  which  the  partition  is  modified  adaptively 
based  on  the  progress  of  the  algorithm.  These  aggregation  methods  have  been  proposed  for 
dealing  with  large,  but  finite  state  and  action  spaces.  In  contrast,  the  methods  described 
in  this  paper  are  targeted  primarily  towards  continuous  state  spaces.  In  fact,  they  can  be 
combined  with  aggregation  methods. 

We  now  give  an  outline  of  the  paper.  In  section  2,  we  state  the  two  basic  MRDP 
models,  one  with  a  discounted  cost  criterion,  the  other  with  an  average  cost  criterion, 
and  we  give  their  associated  DP  functional  equations.  We  also  give  formulas  to  compute 
(or  estimate)  bounds  on  the  optimal  value  function,  and  on  the  value  function  associated 
with  the  retained  policy.  For  the  discounted  case,  we  give  results  for  a  N- stage  locally 
contracting  model  that  generalizes  the  model  considered  in  [13,  17]  (which  was  one-stage 
locally  contracting).  We  also  provide  bounds  for  the  average  cost  case. 

In  section  3,  we  recall  the  usual  value  iteration  and  policy  iteration  algorithms,  with 
some  comments.  In  sections  we  describe  a  finite  element  computational  approach,  using 
an  approximate  policy  iteration  algorithm.  We  consider  both  the  discounted  and  average 
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cost  criteria.  We  focus  on  the  basic  practical  ideas.  We  do  not  perform  a  formal  complexity 
analysis.  Such  an  analysis  is  non-trivial  and  would  require  introducing  other  technical 
conditions  (e.g.,  for  smoothness)  on  the  model.  In  the  conclusion,  we  comment  on  further 
possible  variations  and  we  mention  some  experience  with  numerical  examples. 
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2.  Two  MRDP  models  and  their  functional  equations 


Consider  the  following  (stationary)  Markov  Renewal  Decision  Process  (MRDP)  model  [13]. 
The  state  space  S  and  action  space  A  are  Borel  spaces.  Each  state  s  in  S  has  a  nonempty 
set  of  admissible  actions  A(s).  At  each  of  an  infinite  sequence  of  stages  (events),  the 
decision  maker  observes  the  state  s  and  selects  an  action  a  (also  called  a  decision)  from 
A(s).  Let  0  =  to  <  t\  <••'<  tn  <  <n+i  <  •  •  •  such  that  in  is  the  time  of  occurence  of 
stage  n,  and  let  sn  and  an  be  the  state  and  selected  action  at  that  stage.  A  cost  c(sn,an )  is 
incurred  at  stage  n,  and  the  next  state  sn+1  and  time  of  the  next  stage  tn+\  are  determined 
as 

(^n+Hsn+l)  =  (tn  +  Cia) 

where  the  pair  (£,  s)  is  generated  randomly  according  to  a  probability  measure  Q(-  |  sn,  an ) 
over  [0,oo)  x  S.  A  new  action  an+ 1  is  selected  from  A(sn+i),  and  so  on.  As  in  [3,  13,  17],  we 
assume  that  T  =  {(s,a)  j  s  £  S,a  £  A(s)}  is  an  analytic  subset  of  S  x  A,  that  c  is  a  lower 
semi-analytic  function,  and  that  Q  is  a  Borel  measurable  stochastic  kernel  on  [0,  oo)  x  5 
given  S  x  A.  This  is  less  restrictive  than  the  lower  semi-continuity  assumptions  that  are 
often  made,  and  which  do  not  always  hold  in  practice.  But  we  must  consider  more  general 
policies  than  Borel-measurable. 

A  policy  is  a  universally  measurable  function  p  :  S  — »  A  such  that  p(s)  £  A(s)  for  each 
s  in  S.  Let  U  be  the  set  of  all  policies.  In  this  paper,  we  consider  only  nonrandomized 
stationary  Markov  policies.  Associated  with  any  initial  state  s0  =  s,  and  policy  p,  there  is  a 
uniquely  defined  probability  measure  on  the  set  of  infinite  sequences  (s0>  ^o,  Sj ,  ax , . . .), 
where  an  =  p(sn)  for  each  n.  Let  be  the  corresponding  mathematical  expectation. 


2.1.  The  discounted  model 


Here,  we  assume  that  the  costs  are  discounted  at  rate  p  >  0.  Hence,  a  cost  of  1  incurred 
in  t  units  of  time  is  equivalent  to  a  cost  of  e~pt  incurred  now.  For  each  policy  p  and  initial 
state  so  =  s,  we  introduce,  when  they  exists,  the  values 


and 


V„(s)  =  lim  E „ 

n—+oo  ^ 


n— 1 


p<,c(st,at) 


li=0 


F.(s)  =  infV;(s). 


(3) 

(4) 
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The  functions  VM  and  V.  represent  respectively  the  total  expected  discounted  cost  associ¬ 
ated  with  policy  /z,  and  the  optimal  total  expected  discounted  cost.  A  policy  /z  e  U  is  said 
to  be  e-optimal,  for  e  >  0,  if  VM(s)  <  V,(s)  +  e  for  all  s  6  5. 

For  any  integer  n  >  1,  the  expected  n-stage  discount  factor  associated  with  policy  /z 
and  state  s  is 

[e_p‘n] 

and  satisfies  0  <  an(fi,s)  <  1. 

Under  condition  C  or  LC  below,  this  model  admits  of  analysis  via  contraction  mappings. 

CONDITION  C  (JV-stage  contracting  model):  There  exists  an  integer  N  >  1,  and  real 
numbers  <  1,  cq  <  0  and  Cx  >  0  such  that  for  all  admissible  policies  fx, 

aN(fi,s)  <  ax 

cq  <  c(s,a)  <  Cx.| 

A  policy  /z  is  called  N-stage  distinguished ,  for  an  integer  N  >  1,  if  there  exists  three 
constants  <  1,  c0  and  cx  such  that  for  all  s  in  5  : 

co  <  c(s,/z(s))  <  Cj. 


CONDITION  LC  (iV-stage  locally  contracting  model):  There  exists  a  .N-stage  distin¬ 
guished  policy  jx  and  two  constants  K\  and  K2  such  that 


K\  -j-  K2  >  0 


and  for  every  policy  /z,  all  s  £  S,  and  every  integer  n  such  that  N  <  n  <  2 N, 


K\  -f  K2an(ix,  s)  < 


E 


e  pt'c(s„a,)  .| 

. 


(5) 

(6) 


Let  IB  be  the  set  of  all  extended  real-valued  functions  V  :  5  — »  [— oo,  oo],  endowed  with 
the  supremum  norm  |jV||  =  supg€5  |V(s)|,  and  IBo  be  the  Banach  space  of  all  bounded 
functions  in  IB.  An  operator  <j>  mapping  a  closed  subset  of  IBo  into  itself  is  said  to  be 
contracting  with  factor  a  if  a  <  1  and  |J^( V^)  —  <£(Vi)||  5:  all^2  —  Vi||  for  all  Vi  and  V2  in 
that  subset. 
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Defined  below  are  three  standard  dynamic  programming  operators.  For  V  £  IB,  s  6  S 
|  and  a  6  A(s),  let  (when  the  integral  exists): 

H(V)(s,a)  =  c(s,a)+[  V(s>-*Q(dC  *  ds'  |  s,a)  (7) 

J[0,oo)xS 

T(V)(.)  =  inf  mV)(,,a).  (8) 

aeX(*) 

For  every  policy  /i,  let 

Ttt(V)(s)  =  H(V)(syti(s)).  (9) 

Let  IBj  be  the  subset  of  universally  measurable  and  lower  semi-analytic  functions  in  IBo. 
The  operators  H ,  T  and  are  well  defined  on  IBi  (the  integrals  exist),  and  their  image  is 
also  in  IBx  (see  e.g.  [3,  17]).  Let 

=  |v  G  IBi  ]  <V<  -^i-1  (10) 

for  the  iV-stage  contracting  model,  and 

B.  =  |v  €  Bx  |  Kx  +  min(0,fir2)  <  V  <  (11) 

for  the  iV-stage  locally  contracting  model.  The  following  properties  are  proven  in  [15,  17] 
for  IV  =  1.  The  proofs  can  be  generalized  to  IV  >  1  using  the  same  reasoning  as  in  reference 
[6]- 

THEOREM  1.  Let  V  €  IB..  Under  assumption  C  or  LC: 


(a)  Let  fi  £U.  Under  LC,  suppose  also  that  T^N(Ki+xmn(0,K2))  <  (2N -l)ci/(l-£i) 


for  any  k  >  1.  Then, 

TU(V)  =  V 

(12) 

if  and  only  if  V  =  V^,  and 

hjn||r;(v)-FMi!  =  o. 

(13) 

(b)  IB.  is  closed  under  Tn  for  every  n  >  N, 

T(V)  =  V 

(14) 

if  and  only  if  V  =  V-.,  and 

hjnimn-v.n  =  o.i 

(15) 

From  an  approximate  solution  to  the  functional  equation  (14)  and  the  following  theo¬ 
rem,  one  can  obtain  bounds  on  V .  and  on  the  suboptimality  of  the  retained  policy  (provided 
it  satisfies  a  contraction  condition). 
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THEOREM  2.  We  assume  condition  C  or  condition  LC.  Under  condition  C,  let 
and  N  satisfy  that  condition,  and  no  =  N.  Under  condition  LC,  let  €  (0, 1),  ko  be  the 
smallest  integer  larger  than 

Ki  —  min(0,.&2) 

)«i 


(2jy-i)Cl/(i-f1)- 

(Kt+K2 


and  no  =  Nko.  Let  V,  W  6  IB,,  S~  >  0  and  S+  >  0  such  that 


-6~  <  T(V)  -W  <6+ . 


(17) 


For  any  real  number  x,  let 

4>+(x)  —  supmax(0,  W(s)  —  V(s)  -f  x), 

»€S 

4>~{x)  =  supmax(0,  F(s)  —  W(s)  +  x). 
a£S 

Define 

e+  =  no^+  +  (no  -  l)<£+(0), 
e~  —  n0S~  +  (n0  -  1)^"(0). 

Then, 

-  €“  -  <V,  —  W<€+  +  —— — <f>+(e+).  (18) 

1  —  Oi  1  —  aj 

Moreover,  for  any  >  5+,  since  TM(F)  <  T(V)  +  6o  —  S+,  there  exists  a  policy  fi  such  that 

TAV)  <  W  +  So,  (19) 

and  if  T™  is  contracting  with  modulus  a  <  1  for  some  integer  M  >  1,  and  e0  is  defined  as 

€0  =  MS0  +  (M-  l)^+(0), 


then 

0  <  —  V,  <  e  +  - - (e  )  +  Co  +  - - ^+(eo).  (20) 

1  —  c*i  1  —  a 

PROOF.  The  derivation  of  (18)  is  done  exactly  as  in  the  proof  of  theorem  3.1  (a) 
in  [13].  The  remainder  of  the  proof  can  also  be  done  along  the  lines  of  theorem  3.1  (b) 
in  [13],  except  that  B2  is  replaced  by  IB.,  g2  =  (2 N  —  l)ci/(l  —  Si)  +  80  —  S+  +  ||V||, 
V  =  (Mg2  +  (1  +  a)||U||)/(l  —  a),  equations  (A. 12)  and  (A. 13)  are  replaced  by 

T,(V)(s)>c(s,n(s))-\\V\\  Vs  €5 
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and 


c(*,/x(a))  <  (2 N  -  l)Cl/(l  -  6,)  +  60-6+  +  \\V\\  -  g2 

respectively,  and  after  equation  (A.13),  TM,  g2  and  Vi  are  respectively  replaced  by  Tjf , 
Mg2  and  W.  To  get  the  last  inequality  in  the  proof,  we  use  the  fact  that  T^(V)  <W  +  e o, 
which  can  be  shown  by  the  same  argument  as  in  the  proof  of  (A.9)  in  [13].  I 

In  theorem  2,  the  choice  of  a\  is  left  open.  Note  that  tj,  no,  e-  and  e+  are  0(1/0!). 
<£+(e+)  and  are  also  0{\/a.\)  for  e~  and  e+  large  enough  (and  do  not  depend  on  aj 

otherwise).  Hence,  the  right-hand  side  of  (20)  is  0{\/ct\  +  1/(1  —  <*i)),  and  therefore,  ai 
should  be  kept  away  from  0  or  1.  To  keep  things  simple,  taking  o-i  =  0.5  is  not  a  bad  idea 
in  general.  A  more  sophisticated  approach  is  to  minimize  the  upper  bound  in  (20)  with 
respect  to  Oi\.  Obviously,  this  involves  more  computation.  The  best  thing  to  do  depends 
on  how  much  one  is  willing  to  pay  to  get  a  (possibly)  better  bound  for  the  current  solution. 

One  particular  case  of  the  above  theorem  is  when  T(V)  can  be  used  directly  for  W.  In 
that  case,  we  have  6~  —  8+  =  0.  Under  condition  C  with  N  =  1,  assuming  that  W  =  T(V), 
the  bound  in  (20)  becomes 

VM  -  V.  <  So  +  (r  (0)  +  4>+(80))  .  (21) 

Note  that  tighter  bounds  can  also  be  obtained  for  that  particular  case  as  in  Porteus  [24]. 


2.2.  The  average  cost  case 

We  define  the  average  expected  cost  under  policy  ft,  starting  from  state  so  =  s,  by  (when 
this  expression  exists): 


^>M(a)  =  limsup 


E/i,s 


£?=o  c(*i»«i) 


The  optimal  average  cost  is 


(22) 


(23) 


For  c  >  0,  a  policy  /lx  is  said  to  be  e-optimal  if  Vv(a)  —  V’*('s)  +  c  f°r  all  s  £  S. 

We  now  give  conditions  under  which  the  above  expressions  are  well  defined.  The 
condition  C  below  is  equivalent  to  the  one  formulated  in  the  previous  subsection  for  the 
discounted  case,  with  N  =  1.  Let 


t(s,a)=f  CWC  x  5  |  s,a). 

J[0,<x>) 
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It  represents  the  expected  time  to  the  next  transition  if  the  current  state  is  s  and  action 
a  is  chosen.  We  suppose  that  t  is  Borel-measurable  in  both  s  and  a. 

CONDITION  Cl.  There  exists  constants  r  >  0,  Co  and  cj  such  that  for  every  s  t  S 
and  a  6  ./l(.s), 


r  <  i(s,a), 

Co  <  c(s,a)  <  ci. 


Condition  Cl  means  that  the  cost  per  stage  is  bounded,  and  that  the  expected  duration 
of  a  stage  is  bounded  away  from  zero,  uniformly  over  the  state — action  pairs.  Under 
condition  Cl,  the  process  can  be  uniformized  as  suggested  in  [23,  32],  by  replacing  c,  t  and 
Q  respectively  by: 


c(s,a)  = 


rc(s,a) 

<(s,a) 


t(s,a)  =  t 

Q(S  |  s,  a)  =  -^-Q((0,oo)xS|i,a)+  (l  - v?(s,  5) 
<(s,a)  \  H5>a)/ 

for  any  j  £  S,  a  £  d(s)  and  5  Borel  subset  of  5,  where 

(  c\ _  f  f  ^  €  5; 

^  S’  |  0  otherwise. 


For  each  policy  /z  and  initial  state  s,  let  and  E M>J  be  the  probability  measure  and  corre¬ 
sponding  mathematical  expectation  associated  with  policy  /z  for  the  uniformized  process. 
The  effect  of  the  uniformization  is  to  force  a  transition  every  r  units  of  time,  yielding  a 
discrete-tiiae  process  equivalent  to  the  original  one.  In  fact,  VvC*)  can  be  rewritten  as 


^(s)  =  limsup  —  J2Mi, 

n— *oo  TIT 


n— 1 

yi  c(s,’,  o>i) 

Lt=0 


(24) 


Most  of  the  tools  developped  for  discrete-time  average  cost  dynamic  programming  models 
(see  [4],  chap.  7)  can  then  be  adapted  to  the  model  studied  here.  The  dynamic  program¬ 
ming  operators  can  be  defined  as  in  [4]  for  the  uniformized  model,  and  then  rewritten  in 
terms  of  the  original  model,  yielding  the  operators  defined  below. 


Let  i  €  S  be  some  (arbitrary)  fixed  reference  state.  Let  EL  be  the  subspace  of  univer¬ 
sally  measurable  and  lower  semi-analytic  functions  V  in  Bq  such  that  V (5)  =  0.  We  define 
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} 


the  following  dynamic  programming  operators:  for  V  E  IBi,  s  E  S,  and  a  E  A(s),  let 

B(V)(s,a)  =  c(s,a)+  [  V(s')Q(ds' \s,a)-V(s) 

J  s 

=  (<>.<>)  -V(‘)+ fsV(l')Q([0,ao)  X  d»'  I  s,a))  (25) 

J(V)W  =  inf  B(V)(„,o)  (26) 

a€A(s) 

T(V)(s)  =  V(»)  +  J(V)(3)  -  J(V)(5).  (27) 


For  every  policy  /x,  let 

X.(V)W  =  B(F)(., ,.(»))  (28) 

r,(V)(.)  =  vw  +  J,(V)(.)  -  J,(v)(i).  (29) 

Note  that  T(V )  and  TU(V)  are  also  elements  of  IBj,  while  B(V ),  J(F)  and  JM(V )  are 
universally  measurable  and  lower  semi- analytic  elements  of  1B0.  For  V  E  IBi,  V  =  0  means 
F(s)  =  0  for  all  s  E  S. 


The  relative  value  iteration  (or  successive  approximation)  algorithm  for  the  average 
cost  case  consists  in  choosing  an  initial  Vo  E  JBi,  and  defining  recursively  Vn  =  T(Vrn_]). 
In  the  following  condition  C2,  we  assume  that  this  algorithm  converges  to  a  fixed  point  of 
the  operator  T. 

CONDITION  C2.  For  any  V0  E  Blt  if  Vn  =  T(Vn^)ioT  n  =  1,2, . . then  lim,,^  \\Vn- 
V. II  =0  for  some  V.  E  IBi  solution  of  the  functional  equation: 


T(V)  =  V,  (30) 

and  lim^—oo  J(Vn)(i)  =  J(V.)(s).  | 

This  condition  might  seem  difficult  to  verify,  but  as  indicated  by  the  following  lemma, 
this  should  not  be  a  burden  for  most  practical  applications.  In  most  cases,  indeed,  we 
know  intuitively  that  the  optimal  average  cost  should  be  independent  of  the  initial  state, 
and  all  computations  in  practice  are  done  on  “finite  state”  computers. 

LEMMA  3.  Suppose  that  the  state  and  action  spaces  are  finite.  Then, 

(a)  The  following  are  equivalent:  (1)  C2  holds;  (2)  ipm(s)  is  independent  of  s;  (3) 
T(V)  =  V  for  some  V  E  IBi- 

(b)  If  all  states  communicate  (are  “weakly  connected”  in  the  sense  of  [23]),  i.e.  if  for 
each  pair  s,s'  in  5,  there  is  a  policy  p.  and  an  integer  n  >  0  such  that  PMi5[sn  =  A  > 
then  C2  holds. 


U 


PROOF.  See  Schweitzer  [33]  and  Platzman  [23].  I 

In  the  general  case,  condition  C2  also  implies  that  the  optimal  average  cost  is  indepen¬ 
dent  of  the  initial  state.  This  is  stated  in  the  following  theorem.  On  the  other  hand,  there 
might  be  policies  having  diffeient  average  costs  for  different  initial  states  (even  for  finite 
state  and  action  spaces). 

THEOREM  4.  Under  condition  Cl  above,  if  there  exists  a  bounded  function  K  G  Bi 
such  that  T(V,)  =  V.,  then  J(Vm)  is  constant  and  the  optimal  average  cost  is  given  by 
gm  =  J(V.)(s)/t ,  independently  of  3  or  the  initial  state.  Also,  any  policy  p.  for  which 
=  Vm  is  optimal. 

PROOF.  From  the  definition  of  T ,  T(Vm)  —  V.  is  equivalent  to  J(Vr,)(s)  =  J(K)(-s)  for 
all  s  G  S,  which  is  equivalent  to 

inf  (c(s,a)  -  V,(s)  +  f  V.(s')Q(ds'  |  s,  a)  -  rg =  0,  (31) 

a€-4(»)  \  J S  / 

where  gm  —  J(K)(5)/r.  Equation  (31)  corresponds  to  equation  (3)  in  [30],  and  the  remain¬ 
der  of  the  proof  goes  like  the  proof  of  theorem  2  in  [30].  | 

THEOREM  5.  Under  conditions  Cl  and  C2,  let  V  G  IBi  and  g,8~,8+  in  1R  such  that 


6-  <  J(V)(*)  <  8+ 

(32) 

for  all  s  G  S.  Then, 

8~  <  gm/r  <  8+. 

(33) 

Furthermore,  if  p  is  also  a  policy  such  that 

j„(V)(>)  <  «+ 

(34) 

for  all  s  G  S,  then 

(35) 

for  all  s  G  5,  and  p  is  (£+  —  8  )/r-optimal. 


PROOF.  Let  V0  =  V  and  for  n  =  1,2,...,  let  Vn  =  T(Vn_i )  =  Tn(Vo),  7„  = 
infs€s  J(Vn){s)  and  7„  =  sup,6S  J(V'n)(s).  Let  e  >  0,  s  G  5  and  a  G  A(s)  such  that 

*(V„)(*,a)<  J(Vn)(3)  +  e. 
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Then,  we  have 


J(Vn)(s)  >  B(Vn)(s,a)-e 

=  c(s,a)  +  /  Vn (s')Q(ds'  |  a,  a)  -  Vn(s)  -  e 
J  s 

=  c(s,a)  +  js  +  J(Vn_,)(s'))  Q(ds’  I  5,  a) 

-J(V'n_1)(i)  -  V„(s)  -  c. 


On  the  other  hand, 


Vn(s)  =  T(Vn_i)(s) 

<  Vn.l(s)+B(Vn.1)(s,a)-J(Vn.l)(S) 

=  c(s,a)+  [  K-1(*OQ(dj,|s,«)-J(rB_i)(i). 
J  s 


Combining  these  two  inequalities,  we  obtain 


/(K)(*)  >  J  J(Vn_a)(a')Q(ds' !«,«)-  e. 


Since  this  holds  for  any  3  C  S  and  e  >  0,  we  obtain 

7n  =  inf  J(V„)M  >  inf  J(K-.)W  =  7n-i- 

By  a  similar  argument,  we  can  also  show  that  7„  <  7„_j.  From  condition  C2  and  theorem 
4,  we  know  that  for  all  s  G  S, 


Jim  J(Vn)(s)  =  Jim  J(Vn)(s)  =  J(V.)(s)  =  g./r. 


Therefore, 


Jim  7n  =  Jim  %  =  gjr. 

n— *oo  n— *00 

Since  {7n}  is  increasing  and  {7,,}  is  decreasing,  we  obtain 

<  70  <  0./T  <  To  <  6+- 


For  the  second  part  of  the  proof,  suppose  g.  is  a  policy  such  that 


MVM  <  *+- 


Define  the  operator  Hy  by 


n„(V)(>)  =  V(.)  4-  J.(V)(s)  =  c(a,„W)  +  JsV(s’)Q(ds'  I  a, ,.(*))■ 
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Then,  for  all  a  6  5, 


B,(V)(s)  =  V(s)  +  UV)(s)  <  V(s)  +  6 + 
Hl(V)(s)  <  ff„(V)(s)  +  *+  <  V(s)  +  26+ 

HnAV){s)  <  F(a)  +  n«+ 


and  therefore 


,  ,  .  *Z(V) (■>) 

TibJs)  =  li  m  — - - 

n-*°°  n 


<  lim  |6+  + 

n— *oo  V 


The  fact  that  5.  <  ^(s)  is  obvious  and  this  yields  (35).  From  (35),  we  also  have  Vv(a)  ~ 
<7,  <  (£+  —  6~)/t,  and  this  completes  the  proof.  | 

In  the  next  section,  we  describe  a  general  approach  for  solving  the  functional  equations 
(14)  and  (30). 
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3.  Value  iteration  and  policy  iteration 


Value  iteration  and  policy  iteration  are  two  general  methods  for  solving  dynamic  programs 
like  those  described  in  the  previous  section.  They  operate  as  follows. 

Value  iteration. 

Select  initial  Vo  in  IB  j; 

For  n  :=  1  to  n  do 

Vn:=  T(Vn.t);  (37) 

Retain  p  such  that  T^(Vn)  =  T(Vn); 

End. 

Policy  iteration. 

Select  initial  policy  po; 

For  n  :=  1  to  n  do 


Policy  evaluation:  find  V  such  that 

T^(V)  =  V; 

(38) 

Policy  update:  find  pn  such  that 

TJY)  =  T(V); 

(39) 

Retain  ps; 

End. 

In  both  cases,  the  value  of  n  may  be  chosen  in  advance  or  depend  on  some  stopping 
criterion.  The  operators  T  and  are  defined  in  (8-9)  for  the  discounted  case,  and  in  (27, 
29)  for  the  average  cost  case. 

Value  iteration  converges  to  V.  for  all  the  models  studied  in  the  previous  section,  under 
conditions  C  or  LC  for  the  discounted  case  (taking  Vo  €  IB,),  or  under  Cl  and  C2  for  the 
average  cost  case.  But  for  policy  iteration,  there  might  be  intermediate  policies  p  (not 
optimal),  reached  at  some  iteration,  for  which  is  infinite  for  some  states  (under  LC),  or 
for  which  the  average  cost  depends  on  the  initial  state  (under  Cl  and  C2).  In  the  latter 
case,  for  the  average  cost  model,  p  is  periodic  or  multichain,  and  there  is  no  V  for  which 
r„(V)  =  V.  However,  there  exists  adaptations  of  the  policy  iteration  algorithm  that  can 
work  for  that  case  (see  e.g.  [10]). 

Obviously,  for  continuous  (or  very  large)  state  spaces,  these  algorithms  cannot  be  ap¬ 
plied  exactly  in  general.  Some  form  of  approximation  must  be  used.  Since  solving  (38)  is 
usually  too  difficult  when  the  state  space  is  very  large,  the  use  of  value  iteration  has  been 
advocated  for  that  case  [13,  32].  For  continuous  state  spaces,  Haurie  and  L’Ecuyer  [13]  (see 


also  [16, 18])  compute  (37)  at  a  finite  number  of  points  in  the  state  space  (using  numerical 
integration),  and  use  these  values  to  approximate  Vn  over  the  whole  state  space.  This 
approximation  is  used  in  the  next  iteration  and  the  process  is  repeated.  This  approach  is 
very  time  consuming. 

It  is  also  well  known  that  value  iteration  converges  linearly  (sometimes  very  slowly), 
while  policy  iteration  (when  it  works)  is  equivalent  to  applying  Newton’s  method  to  the 
equation  T(V )  —  V  =  0  (see  [26,  27]).  When  V  is  not  too  far  from  V,,  it  typically  has 
quadratic  convergence.  For  that  reason,  trying  to  adapt  policy  iteration  for  the  case  of 
large  state  spaces  has  been  a  subject  of  interest  in  the  recent  years.  One  adaptation  is  the 
so-called  “modified  policy  iteration”  method  [20,  26],  where  at  each  iteration,  (38)  is  solved 
only  approximately  by  applying  only  a  few  iterations  of  the  value  iteration  method  with  a 
fixed  policy  /in_i,  starting  from  the  previous  V,  In  the  next  section,  we  examine  this  idea, 
combined  with  a  finite  element  approximation  of  the  value  function,  in  the  general  setting 
of  section  2. 
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4.  A  finite  element  approach 

We  now  introduce  an  approximate  policy  iteration  algorithm,  with  finite  element  approx¬ 
imation  of  the  value  function.  Generally  speaking,  we  assume  that  the  value  function  can 
be  approximated  reasonably  well  by  a  linear  combination  of  a  small  set  of  (already  known) 
functions.  Typically,  these  functions  will  be  local ,  in  the  sense  that  their  support  will 
be  a  (small)  subregion  of  S  (with  some  exceptions  e.g.  for  the  case  of  unbounded  state 
spaces).  For  more  details  on  the  finite  element  method  in  general  and  specific  presentation 
of  various  finite  element  schemes,  see  e.g.  [14]. 

4.1.  Approximation  of  IBi  by  a  finite  dimensional  space 

Let  $  =  {f?i, . . .  ,Bj}  C  IBi  be  a  finite  set  of  linearly  independent  functions,  and  let  IBf 
be  the  subspace  of  IBi  spanned  by  the  Bj’s,  i.e.  IBf  is  the  set  of  functions  V  that  can  be 
expressed  as 

(40) 

i=l 

for  some  real  constants  d\, . . .  ,dj.  Let  tJ>  be  a.  measure  on  5,  such  that  -0(5)  =  fs  tp(ds)  is 
finite  and  strictly  positive.  For  each  policy  fi.  and  each  pair  (V-,  W)  in  IBi  x  IBj,  we  define 
the  scalar  product 

(V,  W),  =  /S(T„(V)  -  VX.)W(.)*(d.).  (4!) 

Note  that  TM(F)  =  V  V’(-)-almost  everywhere  in  S  is  equivalent  to  (V,  W )fl  =  0  for  all  W  in 
IBi.  Instead  of  solving  T^(V)  =  V,  we  will  examine  the  latter.  Solving  it  is  quite  difficult 
in  general  and  what  we  will  do  instead  is  to  solve  an  approximate  version  of  the  problem: 
find  V  in  IBf  such  that  (V,  W =  0  for  all  W  in  IBf ,  or  equivalently,  such  that  (V,  Bj)^  =0 
for  j  =  1 ,...,  J.  This  gives  rise  to  J  equations  with  J  unknowns  (the  unknowns  are  the 
coefficients  defining  V). 

4.2.  Building  a  linear  system 

If  we  replace  V  by  the  expression  (40)  in  the  equation  (F,  i?, )M  =  0,  we  obtain  the  equation 

j 

MO  +  X!  =  °»  (42) 

j-1 
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where  for  the  discounted  case,  we  have: 


MO  =  [c(s,fi(s))Bi{s)iJf(ds) 
J  s 


and 


m„ 


=  l  (-BM  +j[0mUS Bj(,')e-«Q(d(  x  d,'  |  ,,(.))]  B,(s)^ds), 

while  for  the  average  cost  case: 

MO  =  Js  (c(s,M3))  ~  c(5,^(5)))  B{(s)j>(ds) 


and 


=  JS(-BM+  JsBi(.,')Q( dj' |  *./>(*)) 

-JsBj(.')Q( d,'  1  i,M(»)))  a(»)l«<U) 

=  Js  (t(s,U»  (~BjM  +/sBj(Smo,co)  x  d.'  I  3,a«(3))) 

-  J  S.m,  »)  x  dV  |  5,/i(i))j  BifsJiHdj) 

(assuming  that  these  quantities  exist  and  are  finite).  For  a  fixed  n,  let  b  and  d  be  the 
column  vectors  b  =  (6M(1), . . .  ,&M(  J))'  and  d  =  {du . . .  ,dj)',  and  let  M  be  the  J  X  J 
matrix  whose  element  (i,j)  is  m^(i,j).  Note  that  b  and  M  depend  on  (j..  T. he  solution  of 
Tjy)  =  V  will  be  approximated  by  V  defined  in  (40),  where  d  is  a  solution  of 


b  +  Md  =  0. 


(43) 


4.3.  Approximate  policy  iteration 

In  general,  policies  must  also  be  approximated:  it  is  usually  not  possible  to  find  /i  such  that 
Tp(V)  =  T(V)  exactly.  As  we  did  for  the  state  space,  we  can  define  a  finite  dimensional 
subspace  of  the  policy  space  17,  and  consider  only  the  policies  that  belong  to  that  subspace. 
Giving  more  details  on  how  to  do  that  would  require  further  specification  of  the  structure 
of  the  action  space  A.  Note  that  A  is  not  necessarily  a  subset  of  1R.  In  some  cases,  it  can 
even  be  a  functional  space.  (The  action  a  can  be  for  instance  a  continuous  time  control  to 
be  applied  until  the  next  transition,  which  can  be  viewed  as  a  stopping  time.)  Since  this 
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is  rather  problem- dependent,  we  will  content  ourselves  with  the  following  approach.  For 
any  V  in  IBi  and  e  >  0,  define 

At(V)  =  {fieU\  T^V)(s)  <  T(V)(s)  +  c  for  all  s  G  5} .  (44) 

This  is  the  set  of  policies  for  which  for  each  state  s,  the  decision  fi(s)  brings  us  no  more 
than  e  away  from  the  infimumin  the  definition  of  T(V).  At  every  “policy  update”  step  of 
the  policy  iteration  algorithm  (equation  (39)),  we  will  in  fact  seek  a  new  policy  in  At(V'), 
for  some  value  of  e.  Often,  in  practice,  we  will  first  find  a  policy  /x  (by  “approximate” 
optimization)  and  then  estimate  the  smallest  e  for  which  fi  €E  Ae(V). 

Under  this  setting,  the  (approximate)  policy  iteration  algorithm  now  becomes: 

Algorithm  1.  (General  form) 

Select  e>0,  initial  policy  //,  and  initial  V  in  JBj; 

If  average  cost,  select  s  €  S; 

Loop 

Select  =  {Bi, ... ,  Bj }  C  B3i  and  the  measure  ^  on  S; 

Compute  b  and  M; 

Solve  approximately:  b  +  Md  =  0  for  d\ 

Define  V  6  IBf  by  equation  (40); 

Select  €i  and  find  new  fi  in  At,(U); 

If  desired,  perform  a  stopping  test: 

If  discounted  case,  use  e.g.  (18)  and  (20)  to  compute  or  estimate 
bounds  on  V,  and  V^:  —  V.  <  e; 

If  average  cost,  use  (35)  to  compute  or  estimate  bounds 

on  g *  and  tS~  <  g*  <  ^m(5)  —  T^+»  ?  =  (^+  —  ^-)/T» 

If  c  <  e,  or  other  stopping  criteria  satisfied,  stop; 

Endloop 

End. 

Obviously,  as  it  stands,  this  algorithm  is  not  completely  defined.  For  instance,  the 
stopping  criteria,  the  way  of  choosing  c,  ci,  /,  $  and  ip,  the  integration  method  used  to 
compute  b  and  M,  the  techniques  used  to  solve  the  linear  system  (perhaps  approximately), 
to  find  a  new  fi  (perform  the  minimization)  and  to  compute  (or  estimate)  e,  are  all  left 
open.  These  are  usually  problem  dependent.  In  practice,  they  may  vary  from  iteration 
to  iteration.  Often,  the  stopping  test  can  be  costly  and  should  not  be  performed  at  each 
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iteration.  Sometimes,  e  has  to  be  estimated  heuristically,  for  instance  as  in  [13]:  Recompute 
T(V)  and  TM(V)  at  a  large  number  of  new  points,  compute  the  approximation  error  at 
these  points,  and  take  the  largest  and  smallest  to  estimate  5+,  S~,  4>+(x),  4>~{x)  and  tj. 
The  function  W  in  theorem  2  can  be  taken  for  instance  as  T(V),  or  as  the  next  value  of 
V. 

In  practice,  instead  of  storing  the  retained  policy,  one  can  store  only  the  vector  d.  An 
appropriate  decision  for  a  given  state  &  can  be  recovered  as  needed  using  (40)  in  (8)  or 
(26).  The  selected  decision  must  be  one  for  which  the  minimum  is  attained  in  (8)  or  (26) 
(or,  if  approximations  are  used,  one  for  which  we  get  close  to  the  infimum). 

4.4.  A  useful  special  case 

One  particular  choice  for  the  measure  ^  is  to  select  a  finite  number  of  points  <Ti, . . .  ,07  in 
5,  a  set  of  positive  weights  and  for  each  S  subset  of  S,  define 

t(S)  =  Y 

bKe$} 

In  this  case,  the  scalar  product  becomes 

(V,W),  =  £(r„(V)  -  V)(<r,) W(«)Vi 

i= 1 

and  the  outer  integrals  in  the  definitions  of  &M(s)  and  are  replaced  by  sums. 

Typically,  when  using  a  finite  element  approach,  the  <r,’s  will  be  the  element  nodes.  We 
will  have  I  =  J  and  each  <pi  equal  to  1.  Suppose  moreover  that  Bj  :  S  — ►  [0,1]  for  each 
j,  Bj((n)  =  5{j  (the  Kronecker  delta),  and  £/=1  Bj(s)  =  1  for  all  s  in  5  (most  usual  finite 
element  schemes  have  these  properties  [14]). 

As  we  will  see  below,  one  interesting  point  of  this  special  case  is  that  usually,  the 
eigenvalues  of  M  are  such  that  the  system  (43)  can  be  solved  (approximately)  by  standard 
iterative  methods,  like  e.g.  Jacobi  or  Gauss-Siedel  iteration  [25].  Let  M  =  M  + 1,  where  I 
denotes  the  identity  matrix,  and  note  that  (43)  is  equivalent  to  d  =  b  +  Md.  One  iterative 
method,  called  pre-Jacobi,  is  simply  defined  by  applying  iteratively  the  affectation 

d:=b+Md,  (45) 

starting  with  some  initial  vector  d.  Moreover,  even  if  we  change  cr  and  $  between  two 
iterations,  it  is  easy  to  compute  the  value  of  d  corresponding  to  the  new  <r  (and  the  same  V ) 
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from  the  previous  one  and  to  continue  iterating  with  it.  More  specifically,  suppose  that  <r  = 
(oi , . . .  ,07)'  is  changed  to  cr  =  (5* , . . . , 5-# )'.  Then,  the  new  d  is  d  =  (dl} . . . ,  dx)\  where 
d{  =  V’(dv)  =  djBj(di).  Note  that  if  $  is  changed,  this  expression  is  independent  of 
the  new  $. 

Under  these  assumptions,  for  the  discounted  case,  we  obtain: 

j 

MO  =  J2c(<Tk,K(Tk))Bi(<Tk)  =  c(ffi,/x(<r,)), 

k=i 

mu(i,j)  =  ^2  (~Bj(<rk)+  [  Bj(s,)e~p<'Q(d(  x  ds'  j  cr*,/*^*)))  Bi(ak) 

k= 1  \  •'[O.ooJxS  / 

=  Sij  +  [  Bj(s')e~p(iQ(d £  x  ds'  |  <r,,/i(<r,)) 

J  [0,oo)  x  S 

>  -Sij 

and 

1  +  =  /  f  e_pCWC  x  ds'  |  <7„/x(o-.)) 

j= 1  j[0,oo)x5  yi=1  / 

=  f  e~p(Q(d(  x  ds'  |  <r,,/z(o\)) 

J[0,oo)x5 

<  I- 

Hence,  M  is  a  substochastic  matrix.  In  practice,  in  most  cases,  it  also  has  a  spectral 
radius  p(M)  <  1,  in  which  case  (45)  converges  geometrically.  But  there  might  be  policies 
p  for  which  M  has  a  spectral  radius  of  one.  These  (rare)  cases  correspond  to  policies  p  for 
which  the  total  expected  discounted  cost  for  the  discretized  model  becomes  infinite.  Such 
policies  are  never  optimal  (at  least  for  the  discretized  model)  and  if  they  are  encountered, 
it  is  most  likely  to  be  at  the  early  iterations. 

For  the  average  cost  case,  we  obtain  in  a  similar  fashion: 

MO  =  c(<ri,p(<n))  -  c(s,p(s)), 

=  -*ij  +  jsBj{s')Q(&s'  |  <n,p(<Ti))  -  Js  Bj{s')Q{ds'  |  s,p(s)). 

We  will  now  suppose  that  the  state  5  is  one  of  the  cr? s.  Without  loss  of  generality,  say 
S  =  <rj.  Define 

mM(t, j)  =  J  Bj(s')Q(ds'  |  <r -,-,/*(?«))• 
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Let  M  be  the  J  x  J  stochastic  matrix  formed  by  these  elements,  ?nd  let  m!  be  its  first 
row.  Then  the  iteration  (45)  can  be  rewritten  as 

g  :=  rh'd]  (46) 

d  :=  b  +  Md-l'g  (47) 

where  1  is  a  vector  of  ones.  The  scheme  (46,  47)  is  in  fact  a  relative  value  iteration  scheme 
applied  to  the  finite  state  discrete  time  Markov  chain  defined  by  the  transition  matrix  M 
and  cost  vector  b.  Its  convergence  is  geometric,  with  a  rate  determined  by  the  subdominant 
(second  largest  in  norm)  eigenvalue  of  M  (see  [21]),  so  long  as  this  subdominant  eigenvalue 
is  inside  the  unit  circle.  Since  M  is  stochastic,  its  largest  eigenvalue  is  always  1.  When  the 
norm  of  the  subdominant  one  is  also  1,  this  indicates  that  the  Markov  chain  corresponding 
to  M  is  multichain  or  periodic.  Periodicity  can  be  eliminated  by  taking  a  slightly  smaller 
value  of  r.  Multichain  matrices  can  still  be  encountered  (rarely)  during  the  iterations,  and 
we  will  mention  below  some  heuristic  ways  to  cope  with  that. 

For  any  V  in  IBi,  a  in  S 1  and  e  >  0,  define 

At(F,<r)  =  {fieU\  T„(F)(< t.)  <  T(V)(<n)  +  e  for  t  =  1, ...,/}  .  (48) 

This  is  the  set  of  policies  for  which  for  each  “selected”  state  <r the  decision  /i(crt)  brings 
us  no  more  than  e  away  from  the  infimum  in  the  definition  of  T(V). 

Under  this  setting,  algorithm  1  for  the  special  case  can  be  rewritten: 

Algorithm  1.  (Special  case) 

Select  e  >  0,  initial  policy  fi,  and  initial  V  in  Bi; 

If  average  cost,  select  5  G  S; 

Loop  (outer  loop): 

Select  J,  a  =  (<Ti, . . . ,  <r  j)'  G  SJ  and  $  =  {B\, .  ..,Bj}  C  such  that 
cr\  =  s,  Bj  :  S  — >  [0,1],  Bj(<Ti)  =  8{j  and  Bj(s)  =  1  for  all  s  6  5; 
Compute  6,  M  and  d  :=  (V(<7i), . . . ,  V(<tj))'-, 

Inner  loop:  select  k  and  repeat  k  times:  d  :=  M d  +  b ; 

Define  V  G  Bf  by  equation  (40); 

Select  ei  and  find  new  n  in  Atl(V,o-); 

If  desired,  perform  a  stopping  test:  (same  as  for  the  general  case); 

Endloop 

End. 
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For  k  =  1,  this  algorithm  becomes  the  value  iteration  (or  successive  approximation) 
algorithm,  as  described  in  [13].  For  k  =  oo,  we  get  policy  iteration.  Like  for  the  general 
version,  many  things  are  left  open.  A  good  choice  of  k  is  problem  dependent.  It  could  be 
chosen  adaptively,  based  on  the  previous  iterations.  Intuitively,  the  more  costly  it  is  to 
compute  b  and  M,  the  larger  the  value  of  k  should  be.  But  the  inner  loop  should  also  stop 
when  progress  gets  too  slow,  i.e.  when  d  is  not  changing  significantly  enough  anymore,  or 
if  d  does  not  appear  to  converge  geometrically. 

One  may  wonder  why  we  are  coming  back  to  a  linearly  convergent  iterative  algorithm 
to  (partially)  solve  the  linear  system,  in  the  policy  evaluation  step,  while  our  primary 
motivation  to  adopt  the  policy  iteration  algorithm  was  its  quadratic  convergence  rate! 
The  problem  is  that  for  large  state  spaces,  solving  the  linear  system  is  difficult.  However, 
the  gain  with  respect  to  straightforward  value  iteration  can  still  be  impressive  due  to  the 
fact  that  performing  one  iteration  of  the  inner  loop  (45)  is  usually  much  less  costly  than 
performing  the  iteration  (37).  In  algorithm  1  above,  the  numerical  integrations  to  compute 
M  are  performed  only  once  every  outer  loop.  In  value  iteration,  since  the  “minimizing” 
policy  usually  changes  every  iteration  (at  least  for  continuous  action  spaces),  together  with 
V,  the  integrals  must  be  recomputed.  This  often  accounts  for  most  of  the  computational 
costs. 

The  choice  of  cr  determines  a  grid  over  the  state  space  5  and  the  a' s  are  the  nodes 
of  the  finite  elements  [14].  Intuitively,  a  coarser  grid  should  be  chosen  at  the  early  stages 
of  the  algorithm  and  the  grid  should  be  refined  only  when  progress  is  stalling.  Multigrid 
techniques  [7]  can  also  be  used:  it  is  often  worthwhile  to  get  back  to  a  coarser  grid  to  make 
“corrections”  when  progress  becomes  too  slow  with  a  fine  grid.  Note  that  the  inner  loop 
can  be  supplemented  with  aggregation  steps,  using  e.g.  the  adaptive  aggregation  method 
proposed  in  [4,  5].  Various  other  techniques  for  the  iterative  solution  of  linear  systems  can 
also  be  used,  e.g.  overrelaxation,  reordering,  etc.  [25]. 

Many  good  choices  for  cr  and  $  axe  usually  available  and  details  on  this  are  covered 
by  a  voluminous  finite  element  literature.  Aboundant  software  also  exists,  some  of  which 
aimed  at  automatic  grid  construction.  However,  this  software  has  been  designed  primarily 
to  solve  partial  differential  equations.  Usually,  in  that  context,  the  scalar  product  is  not 
defined  as  in  section  4.2,  but  is  typically  bilinear,  symmetric,  and  possesses  some  other  nice 
properties  [14].  The  matrix  of  the  resulting  linear  system  is  usually  symmetric  and  sparse. 
Here,  M  is  not  symmetric  in  general.  Its  sparsity  depends  not  only  on  the  “locality”  of 
the  base  functions  B;,  but  also  on  the  stochastic  kernel  Q.  Row  i  of  M  will  be  sparse  if 
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the  set  of  states  reachable  in  one  transition  from  state  <7,  intersects  the  support  of  just  a 
few  of  the  Bj ’s. 

Normally,  the  bounds  on  V,  should  get  closer  at  each  iteration  of  the  outer  loop,  but 
this  is  not  guaranteed.  For  the  discounted  case,  from  theorems  1  and  2,  it  follows  that 
for  k  =  1,  if  the  error  of  approximation  of  T(V)  by  W  and  the  difference  between  W  and 
the  next  V  go  to  0  with  the  iteration  number,  then  ||V  —  V.||  converges  to  0.  For  version 
C  of  the  model,  if  the  sequence  of  values  of  ej  also  goes  to  0,  then  for  any  e  >  0,  an 
e-optimal  policy  is  obtained  after  a  finite  number  of  iterations  (see  also  [13,  theorem  3.3]). 
For  the  average  cost  case,  for  k  =  1,  the  algorithm  becomes  relative  value  iteration,  whose 
convergence  follows  from  condition  C2  (assuming  that  the  discretization  error  goes  to  0). 
In  general,  one  way  to  “insure”  convergence  is  to  adopt  the  following  rule:  whenever  the 
distance  between  the  bounds  on  V .  (in  terms  of  norm)  or  on  g.  is  not  diminishing  at  a  given 
iteration,  with  respect  to  the  best  value  obtained  before,  take  k  —  1  for  the  next  iteration. 
Of  course,  this  is  only  one  way  of  doing  it.  In  practice,  it  is  also  sometimes  possible  to 
detect  these  infinite  cost  or  multichain  policies  for  which  (45)  might  not  converge.  An 
alternative  heuristic  in  this  case  is  to  modify  them  slightly  so  that  p(M )  <  1  (discounted 
case)  or  to  make  M  unichain  (undiscounted  case).  Since  any  optimal  policy  should  have 
the  latter  property,  this  could  usually  be  done  without  impairing  the  algorithm. 

For  the  average  cost  case,  the  (two  step)  scheme  (46,  47)  is  usually  preferable  to  the 
direct  application  of  (45),  because  M is  usually  sparser  than  M .  In  the  algorithm,  we  have 
assumed  that  5  =  cr\ ,  but  of  course,  this  is  not  necessary.  Things  are  easier,  though,  if  s  is 
one  of  the  <r,’s,  since  g  can  then  be  computed  using  the  corresponding  row  of  M . 

4.5.  An  alternative  regression  approach 

For  a  finite  state  and  action  space  model,  Schweitzer  and  Seidmann  [34]  have  proposed  a 
regression  approach  for  polynomial  approximation  of  the  value  functior  A  direct  general¬ 
ization  of  their  approach  leads  to  the  following.  We  select  a  vector  of  states  a  =  (<ri , . . . ,  07), 
for  I  >  J,  and  write  the  expressions  T^(V^)((7,  )  —  V^tr,),  for  i  =  1,. . . ,/.  If  we  replace  V  by 
the  expression  (40)  in  these  expressions  and  integrate,  we  obtain  I  affine  forms  in  terms 
of  d\,  ’  •  •  ,dj.  In  matrix  form,  this  can  be  written  as  say  b  +  Md  (where  b  and  M  are  not 
the  same  as  in  the  previous  subsections). 

If  /  =  J,  (38)  could  be  replaced  by  the  linear  system  b  -f  Md  =  0.  If  /  >  J,  the  latter 
system  would  have  more  equations  than  unknowns.  Rather,  we  can  minimize  a  (possibly 
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weighted)  sum.  of  squares: 

(Md  +  b)'W(Md  +  b)  (49) 

where  W  —  diag(u>i , . . . ,  tu/)  is  a  diagonal  matrix  with  positive  diagonal  elements.  A  trivial 
choice  is  W  =  I  (the  identity  matrix),  which  gives  equal  weights  to  all  points.  The  sum  of 
squares  (49)  is  minimized  when 

M'Wb  +  M'WMd  =  0.  (50) 

Note  that  the  matrix  M'WM  is  symmetric.  In  the  previous  subsections,  the  matrices  M 
and  M  were  not  symmetric  in  general.  If  I  =  J  and  M~l  exists,  (50)  reduces  to  b+Md  =  0. 

In  [34],  S  is  assumed  to  be  finite  and  5  =  {07, . . .  ,  er/}.  In  that  case,  a  sufficient 
condition  for  M'WM  to  be  invertible  is  that  B\,. . .  ,Bj  are  linearly  independent  [34, 
lemma  1].  In  our  case,  this  result  does  not  hold  in  general,  but  a  sufficient  condition  is 
that  M  has  full  rank.  Indeed,  M'WM  not  invertible  means  that  there  exists  a  vector 
y  ^  0  such  that  M'WMy  =  0.  Hence,  y‘ M'W My  =  0  and  since  W  is  diagonal  with  only 
positive  elements  on  the  diagonal,  My  =  0,  which  means  that  M  is  not  of  full  rank.  Even 
when  M'WM  is  not  invertible,  a  solution  of  (50)  (that  minimizes  (49))  can  be  computed, 
but  the  solution  is  not  unique. 

Note  that  this  regression  approach  should  be  viewed  as  an  heuristic.  There  is  no 
guarantee  of  convergence  to  the  optimal  solution,  even  when  S  =  {<xi , . . . ,  <77 }.  Sometimes, 
the  solution  may  even  deteriorate  from  one  iteration  to  the  next.  Deciding  what  to  do 
when  e  is  not  getting  small  enough  is  essentially  a  matter  of  art.  A  problem  can  also 
occur  in  the  average  cost  case  if  a  multichain  policy  is  encountered  at  some  iteration.  One 
possible  (heuristic)  remedy  in  that  case  is  to  modify  the  policy  slightly  to  make  it  unichain 
(it  is  not  always  trivial,  though,  to  recognize  that  a  policy  is  multichain). 

Other  variants  of  this  approach  are  also  proposed  by  Schweitzer  and  Seidmann  [34] 
for  finite  state  and  decision  spaces:  linear  programming  (LP)  and  global  least-squares  fit 
(GLSF).  LP  might  work  well  for  relatively  small  state  and  decision  spaces.  GLSF  is  similar 
to  the  approach  that  we  have  described  in  this  section,  and  it  also  offers  a  guaranteed 
improvement  at  each  iteration.  However,  it  requires  much  more  work  per  iteration.  The 
basic  idea  is  that  after  solving  (50)  for  a,  one  performs  a  linear  search  along  the  direction 
that  links  this  d  and  the  previous  d,  to  minimize  (49)  where  M  and  b  depend  on  y,  y  is 
the  policy  for  which  the  minimum  is  attained  in  the  definition  of  T(V)}  and  V  is  defined 
by  (40).  Note  that  V,  y ,  M  and  b  vary  during  the  linear  search. 
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5.  Conclusion 


We  have  described  a  finite  element  approach  to  solve  MRDP  models  with  continuous  or 
very  large  state  spaces.  It  can  deal  with  most  reasonably  smooth  value  functions,  even  if 
we  don’t  have  an  a  priori  idea  of  their  shapes,  provided  that  the  state  space  is  bounded  and 
has  few  (continuous)  dimensions.  One  can  also  deal  with  certain  non  continuous  or  non 
differentiable  value  functions  by  a  proper  placement  of  the  element  boundaries.  For  high 
dimensional  state  spaces,  that  kind  of  approach  can  also  be  used  if  the  value  function  can 
be  approximated  reasonably  well  by  a  linear  combination  of  a  small  set  of  (a  priori  known) 
base  functions,  or  if  only  a  rough  approximation  is  sufficient.  Note  that  the  dimension 
of  the  state  space  can  sometimes  be  reduced  using  special  techniques.  For  instance,  Saad 
and  Turgeon  [31]  used  principal  component  analysis  to  define  a  linear  mapping  from  S  to 
a  smaller  dimensional  state  space  5,  and  defined  the  value  function  V  only  on  5. 

Defining  the  scalar  product  as  we  have  done  in  equation  (41)  is  only  one  way  of  doing 
it.  There  are  certainly  other  interesting  ways,  leading  to  different  linear  systems,  that 
should  be  explored. 

In  some  numerical  experiments,  speedups  by  factors  of  between  10  and  20  were  achieved 
by  using  the  modified  policy  iteration  algorithm  as  suggested  here,  compared  to  the  more 
naive  value  iteration  approach  described  in  [13]  with  an  optimization  step  at  each  iteration 
(using  similar  grids  and  finite  elements  in  both  cases).  See  [18,  19]  for  more  details. 
The  example  in  [19]  concerns  the  scheduling  of  a  robot  servicing  a  set  of  machines  on  a 
line.  It  arises  in  the  context  of  a  textile  mill,  where  the  machines  are  identical  winding 
heads.  The  state  space  in  that  case  is  comprised  of  a  finite  number  of  closed  intervals 
on  the  real  line.  On  each  interval,  the  (optimal)  value  function  is  continuous,  but  not 
differentiable.  In  [19],  it  was  approximated  by  a  piecewise  linear  function.  The  example 
in  [18]  deals  with  the  dynamic  optimization  of  checkpointing  times  for  database  systems. 
In  one  numerical  illustration,  the  state  space  is  the  union  of  a  finite  segment  of  the  real 
line,  with  a  rectangle  in  the  plane.  That  rectangle  was  partitioned  into  subrectangles,  and 
a  bilinear  approximating  function  was  used  on  each  subrectangle. 


26 


Acknowledgements 


Part  of  this  work  was  done  while  the  author  was  enjoying  the  hospitality  of  the  group 
“Meta-2”  at  INRIA,  Rocquencourt,  France.  Another  part  was  done  while  visiting  the  Op¬ 
erations  Research  Department  at  Stanford  University.  It  has  been  supported  by  NSERC- 
Canada  grant  #  A5463  and  FCAR-Quebec  grant  #  EQ2831.  The  author  wish  to  thank  J. 
P.  Quadrat  and  M.  Goursat  for  valuable  suggestions,  and  M.  Mayrand  for  his  contribution 
to  the  ideas  of  section  4.4. 


References 

[1]  Bellman,  R.,  Kalaba,  R.  and  Kotkin,  B.  “Polynomial  Approximation  —  A  New 
Computational  Technique  in  Dynamic  Programming”,  Math,  of  Computation ,  17, 
8  (1963),  155-161. 

[2]  Bertsekas,  D.  P.  “Convergence  of  Discretization  Procedures  in  Dynamic  Program¬ 
ming”,  IEEE  Trans,  on  Automatic  Control ,  AC-20  (1975),  415-419. 

[3]  Bertsekas,  D.  P.  and  Shreve,  S.  E.  Stochastic  Optimal  Control:  The  Discrete  Time 
Case.  Academic  Press,  New- York,  1978. 

[4]  Bertsekas,  D.  P.  Dynamic  Programming:  Deterministic  and  Stochastic  Models.  Pren¬ 
tice  Hall,  1987. 

[5]  Bertsekas,  D.  P.  and  Castahon,  D.  A.  “Adaptive  Aggregation  for  Infinite  Horizon 
Dynamic  Programming”,  IEEE  Transactions  on  Automatic  Control,  34,  6  (1989), 
589-598. 

[6]  Breton,  M.  and  L’Ecuyer,  P.  “Noncooperative  Stochastic  Games  Under  a  N-stage 
Local  Contraction  Assumption”,  Stochastics,  26  (1989),  227-245. 

[7]  Briggs,  W.  L.  A  Multigrid  Tutorial,  SIAM,  Philadelphia,  1987. 

[8]  Daniel,  J.  W.  “Splines  and  Efficiency  in  Dynamic  Programming”,  J.  Math.  Anal. 
Appl,  54  (1976),  402-407. 

[9]  Denardo,  E.  V.  “Contraction  Mappings  in  the  Theory  Underlying  Dynamic  Program¬ 
ming”,  SIAM  Review ,  9  (1967),  165-177. 


[10]  Federgruen,  A.  and  Spreen,  D.  “A  New  Specification  of  the  Multichain  Policy  Iteration 
Algorithm  in  Undiscounted  Markov  Renewal  Programs”,  Management  Science,  26, 
12  (1980),  1211-1217. 

[11]  Fox,  B.  L.  “Discretizing  Dynamic  Programs”,  J.  Optim.  Theory  and  Appl.,  II  (1973), 
228-234. 

[12]  Haurie,  A.  and  L’Ecuyer,  P.  “A  Stochastic  Control  Approach  to  Group  Preventive 
Replacement  in  a  Multicomponent  System”,  IEEE  Trans,  on  Automatic  Control, 
AC-27,  2  (1982),  387-393. 

[13]  Haurie,  A.  and  L’Ecuyer,  P.  “Approximation  and  Bounds  in  Discrete  Event  Dynamic 
Programming”,  IEEE  Transactions  on  Automatic  Control,  AC-31,  3  (1986),  227-235. 

[14]  Hugues,  T.  J.  R.,  The  Finite  Element  Method:  Linear  Static  and  Dynamic  Finite 
Element  Analysis,  Prentice-Hall,  Englewood,  New  Jersey,  1987. 

[15]  L’Ecuyer,  P.  and  Haurie,  A.  “Discrete  Event  Dynamic  Programming  in  Borel  Spaces 
with  State  Dependent  Discounting”,  Report  no.  DIUL-RR-8309,  Departement  d’lh- 
formatique,  Univ.  Laval,  1983. 

[16]  L’Ecuyer,  P.  and  Haurie,  A.  “The  Repair  vs  Replacement  Problem:  A  Stochastic 
Control  Approach”,  Optimal  Control  Applications  and  Methods,  8  (1987),  219-230. 

[17]  L’Ecuyer,  P.  and  Haurie,  A.  “Discrete  Event  Dynamic  Programming  with  Simultane¬ 
ous  Events”,  Math,  of  Oper.  Research ,  13,  1  (1988),  152-163. 

[18]  L’Ecuyer,  P.  and  Malenfant,  J.  “Computing  Optimal  Checkpointing  Strategies  for 
Rollback  and  Recovery  Systems”,  IEEE  Trans,  on  Computers ,  C-37,  4  (1988),  491- 
496. 

[19]  L’Ecuyer,  P.,  Mayrand,  M.  and  Dror,  M.  “Dynamic  Scheduling  of  a  Robot  Servicing 
Machines  on  a  One-Dimensional  Line”,  submitted  for  publication,  1988. 

[20]  Morton,  T.  E.,  “Undiscounted  Markov  Renewal  Programming  Via  Modified  Successive 
Approximation”,  Oper.  Res.,  19  (1971),  1081-1089. 

[21]  Morton,  T.  E.  and  Wecker,  W.  E.,  “Discounting,  Ergodicity  and  Convergence  for 
Markov  Decision  Processes”,  Managament  Science,  23,  8  (1977),  890-900. 


28 


[22]  Morin,  T.  L.  “Computational  Advances  in  Dynamic  Programming”,  in  Dynamic  Pro¬ 
gramming  and  its  Applications ,  M.  L.  Puterman  Ed.,  Academic  Press,  New- York 
(1978),  53-90. 

[23]  Platzman,  L.  “Improved  Conditions  for  Convergence  in  Undiscounted  Markov  Re¬ 
newal  Programming”,  Operations  research ,  25  (1977),  529-533. 

[24]  Porteus,  E.  L.  “Bounds  and  Transformations  for  Finite  Markov  Decision  Chains”, 
Operations  Research ,  23  (1975),  761-784. 

[25]  Porteus,  E.  L.  “Computing  the  Discounted  Return  in  Markov  and  Semi-Markov 
Chains”,  Nav.  Res.  Log.  Quart.,  28,  4  (1981),  567-578. 

[26]  Puterman,  M.  L.  and  Shin,  M.  C.  “Modified  Policy  Iteration  Algorithms  for  Dis¬ 
counted  Markov  Decision  Problems”,  Management  Science ,  24, 11  (1978),  1127-1137. 

[27]  Puterman,  M.  L.  and  Brumelle,  S.  L.  “On  the  Convergence  of  Policy  Iteration  in 
Stationary  Dynamic  Programming”,  Math,  of  Oper.  Research ,  4  (1979),  60-69. 

[28]  Rishel,  R.  “Group  Preventive  Maintenance:  An  Example  of  Controlled  Jump  Pro¬ 
cesses”,  Proc.  20th  IEEE  Conf.  on  Decision  and  Control,  San  Diego,  CA,  Dec. 
(1981),  786-791. 

[29]  Ross,  S.  M.  Applied  Probability  Models  with  Optimization  Applications,  Holden  Day, 
1970. 

[30]  Ross,  S.  M.  “Average  Cost  Semi- Markov  Decision  Processes”,  J.  Applied  Probability , 
7  (1970),  649-656. 

[31]  Saad,  M.  and  Turgeon,  A.,  “Application  of  Principal  Component  Analysis  to  Long- 
Term  Reservoir  Management”,  Water  Resources  Research,  24,  7  (1988),  907-912. 

[32]  Schweitzer,  P.  J.  “Iterative  Solution  of  the  Functional  Equations  of  Undiscounted 
Markov  Renewal  Programming”,  J.  Math.  Anal.  Appl.,  34  (1971),  495-501. 

[33]  Schweitzer,  P.  J.  “On  the  Existence  of  Relative  Values  for  Undiscounted  Markovian 
Decision  Processes  with  a  Scalar  Gain  Rate”,  J.  Math.  Anal.  Appl.,  104  (1984), 
67-78. 

[34]  Schweitzer,  P.  J.  and  Seidmann,  A.  “Generalized  Polynomial  Approximations  in 
Markovian  Decision  Processes”,  J.  Math.  Anal.  Appl.,  110  (1985),  568-582. 


29 


ssssasss* 


v>W:’> 


[35]  Whitt,  W.  “Approximations  of  Dynamic  Programs  I  and  II”,  Math,  of  Oper.  Re¬ 
search,  3  (1978),  231-243,  and  4  (1979),  179-185. 


30 


